Data Visualization Project 03

PART 1: Density Plots

Using the dataset obtained from FSU’s Florida Climate Center, for a station at Tampa International Airport (TPA) for 2022, attempt to recreate the charts shown below which were generated using data from 2016. You can read the 2022 dataset using the code below:

library(tidyverse)
weather_tpa <- read_csv("https://raw.githubusercontent.com/reisanar/datasets/master/tpa_weather_2022.csv")
# random sample 
sample_n(weather_tpa, 4)
## # A tibble: 4 × 7
##    year month   day precipitation max_temp min_temp ave_temp
##   <dbl> <dbl> <dbl>         <dbl>    <dbl>    <dbl>    <dbl>
## 1  2022     1    28       0.00001       67       54     60.5
## 2  2022     9    24       0             92       76     84  
## 3  2022     5    29       0             92       72     82  
## 4  2022     5    17       0             89       75     82
summary(weather_tpa)
##       year          month             day        precipitation   
##  Min.   :2022   Min.   : 1.000   Min.   : 1.00   Min.   :0.0000  
##  1st Qu.:2022   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:0.0000  
##  Median :2022   Median : 7.000   Median :16.00   Median :0.0000  
##  Mean   :2022   Mean   : 6.526   Mean   :15.72   Mean   :0.1697  
##  3rd Qu.:2022   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:0.0300  
##  Max.   :2022   Max.   :12.000   Max.   :31.00   Max.   :2.8600  
##     max_temp        min_temp        ave_temp    
##  Min.   :45.00   Min.   :31.00   Min.   :38.00  
##  1st Qu.:80.00   1st Qu.:63.00   1st Qu.:71.00  
##  Median :87.00   Median :70.00   Median :78.00  
##  Mean   :84.54   Mean   :68.21   Mean   :76.37  
##  3rd Qu.:92.00   3rd Qu.:77.00   3rd Qu.:84.00  
##  Max.   :98.00   Max.   :83.00   Max.   :89.50

See https://www.reisanar.com/slides/relationships-models#10 for a reminder on how to use this type of dataset with the lubridate package for dates and times (example included in the slides uses data from 2016).

Using the 2022 data:

  1. Create a plot like the one below:

Hint: the option binwidth = 3 was used with the geom_histogram() function.

A

library(ggplot2)
library(RColorBrewer)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
weather_tpa <- read_csv("https://raw.githubusercontent.com/reisanar/datasets/master/tpa_weather_2022.csv")
## Rows: 365 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): year, month, day, precipitation, max_temp, min_temp, ave_temp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sample_n(weather_tpa, 4)
## # A tibble: 4 × 7
##    year month   day precipitation max_temp min_temp ave_temp
##   <dbl> <dbl> <dbl>         <dbl>    <dbl>    <dbl>    <dbl>
## 1  2022     1    14             0       72       55     63.5
## 2  2022     9    29             0       77       65     71  
## 3  2022    11     7             0       87       70     78.5
## 4  2022    12    25             0       46       31     38.5
tpa_clean <- weather_tpa %>% 
  unite("doy", year, month, day, sep="-") %>% 
  mutate(date=ymd(doy), 
         max_temp=as.double(max_temp), 
         min_temp=as.double(min_temp))
tpa_months <- tpa_clean %>% 
  mutate(month_num=month(date), 
    month_abb=month(date, label=TRUE), 
    month_name=month(date, label=TRUE, abbr=FALSE))
tpa_months %>% 
  ggplot(aes(x=max_temp, na.rm=T)) +
  geom_histogram(aes(fill = month_name, position = "dodge"), binwidth=3, color="white") +
  scale_fill_viridis_d() +
  labs(x = "Max Temperature", y = "Number of Days") +
  facet_wrap(~ month_name, ncol = 4) +
  xlim(c(60, 90)) +
  ylim(c(0,20)) +
  theme_bw() +
  theme(legend.position="") 
## Warning in geom_histogram(aes(fill = month_name, position = "dodge"), binwidth =
## 3, : Ignoring unknown aesthetics: position
## Warning: Removed 116 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 24 rows containing missing values (`geom_bar()`).

  1. Create a plot like the one below:

B

tpa_clean %>% 
  ggplot(aes(x=max_temp)) +
  geom_density(bw=0.5, fill="darkgray") +
  xlim(c(60,90)) +
  scale_y_continuous(breaks=seq(0, 0.08, by=0.02)) +
  scale_fill_manual(name="Area") +
  labs(x="Maximum temperature") +
  theme_bw()
## Warning: Removed 116 rows containing non-finite values (`stat_density()`).

Hint: check the kernel parameter of the geom_density() function, and use bw = 0.5.

  1. Create a plot like the one below:

Hint: default options for geom_density() were used.

C

tpa_months %>% 
  ggplot(aes(x=max_temp, na.rm=T)) +
  geom_density(aes(fill=month_name, position="dodge"), binwidth=3, color="black", alpha=0.5) +
  scale_fill_viridis_d() +
  labs(title="Density plot for each month in 2022", x="Maximum temperatures", y="") +
  facet_wrap(~ month_name, ncol=4) +
  xlim(c(60, 90)) +
  ylim(c(0,0.25)) +
  theme_bw() +
  theme(legend.position="") 
## Warning in geom_density(aes(fill = month_name, position = "dodge"), binwidth =
## 3, : Ignoring unknown parameters: `binwidth`
## Warning in geom_density(aes(fill = month_name, position = "dodge"), binwidth =
## 3, : Ignoring unknown aesthetics: position
## Warning: Removed 116 rows containing non-finite values (`stat_density()`).

  1. Generate a plot like the chart below:

Hint: use the{ggridges} package, and the geom_density_ridges() function paying close attention to the quantile_lines and quantiles parameters. The plot above uses the plasma option (color scale) for the viridis palette.

D

library(ggridges)
tpa_months %>% 
  ggplot(aes(x=max_temp, y=month_name, fill=stat(x))) +
  stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE,
                      quantiles = 2, quantile_lines = TRUE) +
  labs(x="Maximum temperature (in Fahrenheit degrees)", y="") +
  scale_fill_viridis_c(name = "", option = "C") +
  theme_minimal()
## Warning: `stat(x)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## Picking joint bandwidth of 1.93
## Warning: Using the `size` aesthietic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.

  1. Create a plot of your choice that uses the attribute for precipitation (values of -99.9 for temperature or -99.99 for precipitation represent missing data).

E

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
int_plot <- tpa_months %>% 
  ggplot(aes(x=max_temp, y=precipitation, color=month_name)) +
  geom_point() +
  labs(x="Maximum temperature", y="Precipitation", title="Precipitation vs. maximum temperature", subtitle="Tampa International Airport - 2022", color="") +
  theme_bw()
interactive_plot <- ggplotly(int_plot)
interactive_plot

In this plot, I wanted to include an interactive visualization since they’re my favorite.

As someone who grew up in Florida, I can definitely confirm we have more rain during the warmer seasons, especially this year. Breaking down the graph by months, and we can see when the cold/warm fronts normally hit in the fall and spring.

htmlwidgets::saveWidget(interactive_plot, "precipitation.html")

PART 2

You can choose to work on either Option (A) or Option (B). Remove from this template the option you decided not to work on.

Option (A): Visualizing Text Data

Review the set of slides (and additional resources linked in it) for visualizing text data: https://www.reisanar.com/slides/text-viz#1

Choose any dataset with text data, and create at least one visualization with it. For example, you can create a frequency count of most used bigrams, a sentiment analysis of the text data, a network visualization of terms commonly used together, and/or a visualization of a topic modeling approach to the problem of identifying words/documents associated to different topics in the text data you decide to use.

Make sure to include a copy of the dataset in the data/ folder, and reference your sources if different from the ones listed below:

(to get the “raw” data from any of the links listed above, simply click on the raw button of the GitHub page and copy the URL to be able to read it in your computer using the read_csv() function)

library(tidytext)

Florida Poly news articles are actually something I frequently work with at my job, so this dataset combined with my interest in sentiment analysis. I wanted to visualize the sentiment of words across a negative-positive range, so I used AFINN to accomplish this.

poly_news <- read.csv("https://raw.githubusercontent.com/reisanar/datasets/master/flpoly_news_SP23.csv")
poly_sentiment <- poly_news %>% 
  unnest_tokens(word, news_title) %>%
  mutate(word_count = 1:n(),
         index = word_count %/% 500 + 1) %>% 
  inner_join(get_sentiments("afinn")) 
## Joining, by = "word"
poly_months <- poly_sentiment %>% 
  mutate(month_num=month(news_date), 
    month_abb=month(news_date, label=TRUE), 
    month_name=month(news_date, label=TRUE, abbr=FALSE),
    )

I tokenized the data down to individual words, and selected sentiments based on AFINN. I then ammended the months to get the full month names from their corresponding numbers. I then plugged this into ggplot and created side-by-side month graphs to show the variation in sentiment across each month.

poly_months %>% 
  ggplot(aes(x=value, fill=month_name)) +
  geom_bar() +
  guides(fill = FALSE) +  
  labs(x = "Sentiment Value", y = "Count", title="Sentiment analysis on Florida Polytechnic University news across each month") +
  facet_wrap(~ month_name, ncol = 4, scales="free_x") +
  theme_gray() +
  scale_fill_viridis_d() +
  xlim(c(-5,5))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## Warning: Removed 1 rows containing missing values (`geom_bar()`).

- The results I’ve inferred from this visualization is that January, February, September, and October frequently spike at a 2.5 sentiment value. This may be due to these months being the first two months during Fall and Spring semester, meaning there is frequently positive news at the beginning of the main semesters. I honestly expected a spike in May for graduation, but that possibly may be reflected earlier, in April.

  • I used viridis to keep the report tidy, and February happens to have a Poly color - so I figured this was fitting as well. I think these monthly side-by-side plots are a good way to visualize data across time without causing the graph to be confusing.